pm566 midterm project

Author

Sabrina (Hsi-Hsuan) Yang

Introduction:

According to CDC, heart disease is the leading cause of death in the United States. 47% of the people in the United States have at least one of three key risk factors for heart disease: high blood pressure, high blood cholesterol, and smoking. Other key factors like diabetic status, BMI, not getting enough physical activity or drinking too much alcohol can also contribute to heart disease. Thus, detecting and preventing risk factors that have the greatest impact on heart disease is important to healthcare. In this project, the main factors that are being analyzed are BMI, sleep, physical health, diabetic status (yes or no), alcohol consumption (yes or no), smoking, physical activity. These factors are being analyzed to see the relationship between itself and the outcome (have heart disease or not) and determine whether these variables are statistically significant associated with heart disease. Multivariable regression analysis was also performed to evaluate the relationship between the heart disease and the combination of all the factors.

Data and indicators:

The data was from 2020 with 319795 data points and 18 variables. The dataset was obtained from Kaggle (https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease/data)

The heart disease health indicators based on the BRFSS 2015 Codebook (https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf) BMI: Body Mass Index Smoking: Have you smoked at least 100 cigarettes in your entire life? Physical activity: Adults who reported doing physical activity/exercise during the past 30 days other than their regular jobs Diabetic status: Were you ever told you have diabetes? Physical health: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?

library(data.table)
library(leaflet)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::between()     masks data.table::between()
✖ dplyr::filter()      masks stats::filter()
✖ dplyr::first()       masks data.table::first()
✖ lubridate::hour()    masks data.table::hour()
✖ lubridate::isoweek() masks data.table::isoweek()
✖ dplyr::lag()         masks stats::lag()
✖ dplyr::last()        masks data.table::last()
✖ lubridate::mday()    masks data.table::mday()
✖ lubridate::minute()  masks data.table::minute()
✖ lubridate::month()   masks data.table::month()
✖ lubridate::quarter() masks data.table::quarter()
✖ lubridate::second()  masks data.table::second()
✖ purrr::transpose()   masks data.table::transpose()
✖ lubridate::wday()    masks data.table::wday()
✖ lubridate::week()    masks data.table::week()
✖ lubridate::yday()    masks data.table::yday()
✖ lubridate::year()    masks data.table::year()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.25.0 (2022-06-12 02:20:02 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'

The following object is masked from 'package:R.methodsS3':

    throw

The following objects are masked from 'package:methods':

    getClasses, getMethods

The following objects are masked from 'package:base':

    attach, detach, load, save

R.utils v2.12.2 (2022-11-11 22:00:03 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'

The following object is masked from 'package:tidyr':

    extract

The following object is masked from 'package:utils':

    timestamp

The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(knitr)
library(skimr)
heart_initial<- read_csv("/Users/sabrinayang/Downloads/heart_2020_cleaned.csv")
Rows: 319795 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
dbl  (4): BMI, PhysicalHealth, MentalHealth, SleepTime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(heart_initial)
Data summary
Name heart_initial
Number of rows 319795
Number of columns 18
_______________________
Column type frequency:
character 14
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
HeartDisease 0 1 2 3 0 2 0
Smoking 0 1 2 3 0 2 0
AlcoholDrinking 0 1 2 3 0 2 0
Stroke 0 1 2 3 0 2 0
DiffWalking 0 1 2 3 0 2 0
Sex 0 1 4 6 0 2 0
AgeCategory 0 1 5 11 0 13 0
Race 0 1 5 30 0 6 0
Diabetic 0 1 2 23 0 4 0
PhysicalActivity 0 1 2 3 0 2 0
GenHealth 0 1 4 9 0 5 0
Asthma 0 1 2 3 0 2 0
KidneyDisease 0 1 2 3 0 2 0
SkinCancer 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
BMI 0 1 28.33 6.36 12.02 24.03 27.34 31.42 94.85 ▇▅▁▁▁
PhysicalHealth 0 1 3.37 7.95 0.00 0.00 0.00 2.00 30.00 ▇▁▁▁▁
MentalHealth 0 1 3.90 7.96 0.00 0.00 0.00 3.00 30.00 ▇▁▁▁▁
SleepTime 0 1 7.10 1.44 1.00 6.00 7.00 8.00 24.00 ▁▇▁▁▁
#remove duplicated rows
duplicates <- duplicated(heart_initial)
duplicate_rows <- heart_initial[duplicates, ]
heart <- subset(heart_initial, !duplicated(heart_initial))

Data inspection includes: check dimensions using dim(heart), check the first few rows (headers) using head(heart), check the last few rows (footers) using tail(heart), and check variable names and types using str(heart).

Rename variables to shorter length and all lower case

heart<-
  heart %>%
  rename(
    heartdis = HeartDisease,
    bmi = BMI,
    smoking = Smoking,
    alc = AlcoholDrinking,
    stroke = Stroke,
    phys_H = PhysicalHealth,
    ment_H = MentalHealth,
    diff_walk = DiffWalking,
    sex = Sex,
    age = AgeCategory,
    race = Race,
    diabetic = Diabetic,
    phys_A = PhysicalActivity,
    health = GenHealth,
    sleep = SleepTime,
    asthma = Asthma,
    kidney = KidneyDisease,
    skincancer = SkinCancer
  )
#Is there any missing value
any(is.na(heart))
[1] FALSE

There is no missing value in the dataset.

Preliminary Results

#Categorize bmi into groups (category is based on CDC standard)
heart$bmi_group <- ifelse(heart$bmi < 18.5, "underweight",
                          ifelse(heart$bmi >= 18.5 & heart$bmi < 25, "healthy",
                                 ifelse(heart$bmi >= 25 & heart$bmi < 30, "overweight",
                                        ifelse(heart$bmi >= 30, "obese","not obese"))))
table(heart$bmi_group)

    healthy       obese  overweight underweight 
      90858      100344      105432        5083 
table(heart$heartdis)

    No    Yes 
274456  27261 
#Change yes, no into binary 1,0
library(dplyr)
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
heart <- heart %>%
  mutate(heartdis_bi = if_else(heartdis == "Yes", 1, 0))
#Heart disease outcome by bmi groups in plot (data visualisation)
heart <-
  heart %>%
  mutate(outcome_heart = factor(heartdis_bi))
plot_bmi<-
  heart %>%
  ggplot()+
  geom_bar(mapping = aes(x = bmi_group, fill = outcome_heart))+
  labs(title = "Heart disease outcome by bmi groups")
ggplotly(plot_bmi)

The plot shows that most people in the study are either obese or overweight. Majority of the people in the study who have heart disease are in obese and overweight group, while the least people are in the underweight group.

#Is BMI a significant risk factor for heart disease?
#Regression model for bmi
overallbmi <- glm (outcome_heart ~ bmi, data = heart, family = binomial)
summary(overallbmi)

Call:
glm(formula = outcome_heart ~ bmi, family = binomial, data = heart)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.9874047  0.0273202  -109.3   <2e-16 ***
bmi          0.0234955  0.0009071    25.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 183054  on 301716  degrees of freedom
Residual deviance: 182416  on 301715  degrees of freedom
AIC: 182420

Number of Fisher Scoring iterations: 5

In this logistic regression model, the dependent variable is whether an individual has heart disease (0 for no heart disease, 1 for having heart disease) and the independent variable is BMI. For a one-unit increase in BMI, the estimated log-odds of having heart disease increase by 0.0234955 and the p-value being extremely close to zero (p< 2 x 10^-16) suggest that there is an evidence that BMI is associated with the likelihood of having heart disease.

Breaking BMI into categories makes it easier to interpret the model in terms of risk associated with different BMI categories.

#Regression model for bmi groups
reg_bmi<- glm (outcome_heart ~ bmi_group, data = heart, family = binomial)
summary(reg_bmi)

Call:
glm(formula = outcome_heart ~ bmi_group, family = binomial, data = heart)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -2.59859    0.01307 -198.83  < 2e-16 ***
bmi_groupobese        0.46838    0.01661   28.20  < 2e-16 ***
bmi_groupoverweight   0.33323    0.01680   19.84  < 2e-16 ***
bmi_groupunderweight  0.14108    0.05365    2.63  0.00855 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 183054  on 301716  degrees of freedom
Residual deviance: 182204  on 301713  degrees of freedom
AIC: 182212

Number of Fisher Scoring iterations: 5

Coefficients for BMI groups: looking at the effect of different BMI categories on the likelihood of a person having a heart disease compared to healthy group.

The response variable being 1 indicates a person having a heart disease, 0 indicates a person doesn’t have a heart disease. For the obese group, the coefficient of 0.46838 indicates that, compared to individuals in healthy group, individuals in obese group have about 0.46838 increase in log-odds of having heart disease. Same logic applies to individuals who are overweight (coefficient: 0.33323) and underweight (0.14108). All p-values associated with these coefficients are <0.05, which suggest that the effect of BMI categories on the likelihood of having heart disease is statistically significant. The coefficients show that there is a positive association between higher BMI categories and the likelihood of having heart disease.

library(mfp)
Loading required package: survival

Attaching package: 'survival'
The following object is masked _by_ '.GlobalEnv':

    heart
#heart disease and sleep time relationship
heart$sleep_group <- ifelse(heart$sleep < 7, "lack of sleep",
                          ifelse(heart$sleep >= 7, "enough sleep", "no sleep"))

model_sleep<- glm(outcome_heart ~ sleep_group, 
         family = binomial,
         data = heart)
summary(model_sleep)

Call:
glm(formula = outcome_heart ~ sleep_group, family = binomial, 
    data = heart)

Coefficients:
                          Estimate Std. Error  z value Pr(>|z|)    
(Intercept)              -2.344716   0.007781 -301.343  < 2e-16 ***
sleep_grouplack of sleep  0.109211   0.013468    8.109 5.11e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 183054  on 301716  degrees of freedom
Residual deviance: 182989  on 301715  degrees of freedom
AIC: 182993

Number of Fisher Scoring iterations: 5
#boxplot heart disease vs sleep hour 
heart %>% 
  filter(!(sleep %in% NA)) %>%
ggplot()+
  geom_boxplot(mapping = aes(y=sleep, x=outcome_heart))+
labs(title="heart disease vs sleep hour", x="heart disease (0=no, 1= yes)", y="sleep hours") 

#stacked histogram for heart disease vs sleep hour
sleep1 <- ggplot(heart, aes(x=sleep, color=outcome_heart, fill=outcome_heart)) +
  scale_fill_manual(values = c("skyblue", "pink")) +
  scale_colour_manual(values = c("black", "black")) +
  geom_histogram()
ggplotly(sleep1)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The results of the regression analysis for heart disease with sleep groups (<7 hours of sleep are indicated as “lack of sleep,” >=7 hours of sleep are indicated as “enough sleep”) suggest that the individuals in “lack of sleep” group have an estimated increase of 0.109211 in the log-odds of having heart disease compared to those with enough sleep, assuming all other factors are held constant. The increase is statistically significant as the p-value is 5.11 x 10^-16, which is less than .05. The stacked histogram shows that the majority of the individuals who have heart disease have an average sleep of 7.93 hours. Also based on the visualisation through the boxplot, it seems like there is no obvious difference between sleep hours and heart disease as the distributions for heart disease and no heart disease are relatively similar. Therefore, exploring the correlation to understand the relationship can provide more insights.

# Calculate the correlation between heart disease and sleep hours
library(polycor)
cor_sleep <- cor(heart$sleep, as.numeric(heart$outcome_heart))
print(cor_sleep)
[1] 0.01083366
model1<- glm(outcome_heart ~ sleep, 
         family = binomial,
         data = heart)
summary(model1)

Call:
glm(formula = outcome_heart ~ sleep, family = binomial, data = heart)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.490325   0.031157 -79.929  < 2e-16 ***
sleep        0.025466   0.004278   5.953 2.64e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 183054  on 301716  degrees of freedom
Residual deviance: 183019  on 301715  degrees of freedom
AIC: 183023

Number of Fisher Scoring iterations: 5

The correlation coefficient is very close to 0 (0.0108336), but the p-value is statistically significant (p<.0001). It implies that there is a significant relationship between the variables sleep hours and heart disease, but the strength of the relationship is weak. This situation can suggest that the sample size is large (the dataset contains 319795 data points), could consider other variables involved, and there might be a non-linear relationship as linear relationship is what the Pearson correlation coefficient measures.

The categorical variables below (physical health, diabetic status, alcohol status, smoking status, physical activity) were each ran through logistic regression model to see the association between the variable and the outcome variable (heart disease).

#heart disease vs physical health
mfp::mfp(outcome_heart ~ fp(phys_H), family = binomial, data = heart)
Call:
mfp::mfp(formula = outcome_heart ~ fp(phys_H), data = heart, 
    family = binomial)


Deviance table:
         Resid. Dev
Null model   183053.8
Linear model     176719.9
Final model  176404.5

Fractional polynomials:
       df.initial select alpha df.final power1 power2
phys_H          4      1  0.05        4     -2      0


Transformations of covariates:
                                          formula
phys_H I(((phys_H+1)/10)^-2)+log(((phys_H+1)/10))

Coefficients:
Intercept   phys_H.1   phys_H.2  
-1.830579   0.005437   0.574534  

Degrees of Freedom: 301716 Total (i.e. Null);  301714 Residual
Null Deviance:      183100 
Residual Deviance: 176400   AIC: 176400 
glm(outcome_heart ~ phys_H, family = binomial, data = heart) %>% summary()

Call:
glm(formula = outcome_heart ~ phys_H, family = binomial, data = heart)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.5646860  0.0075732 -338.65   <2e-16 ***
phys_H       0.0495475  0.0005772   85.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 183054  on 301716  degrees of freedom
Residual deviance: 176720  on 301715  degrees of freedom
AIC: 176724

Number of Fisher Scoring iterations: 5

The result shows the association between heart disease vs physical health function is statistically significant (p<2x10^-16).

#heart disease vs diabetic status
heart <- heart %>%
  mutate(diabetic_status = if_else(diabetic == "Yes", 1, 0))
mfp::mfp(outcome_heart ~ fp(diabetic_status), family = binomial, data = heart)
Warning: glm.fit: algorithm did not converge
Call:
mfp::mfp(formula = outcome_heart ~ fp(diabetic_status), data = heart, 
    family = binomial)


Deviance table:
         Resid. Dev
Null model   183053.8
Linear model     175460.4
Final model  175460.4

Fractional polynomials:
                df.initial select alpha df.final power1 power2
diabetic_status          4      1  0.05        1      1      .


Transformations of covariates:
                                 formula
diabetic_status I((diabetic_status+1)^1)

Coefficients:
        Intercept  diabetic_status.1  
           -3.906              1.322  

Degrees of Freedom: 301716 Total (i.e. Null);  301715 Residual
Null Deviance:      183100 
Residual Deviance: 175500   AIC: 175500 
glm(outcome_heart ~ diabetic_status, family = binomial, data = heart) %>% summary()

Call:
glm(formula = outcome_heart ~ diabetic_status, family = binomial, 
    data = heart)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -2.584689   0.007663 -337.29   <2e-16 ***
diabetic_status  1.321665   0.014216   92.97   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 183054  on 301716  degrees of freedom
Residual deviance: 175460  on 301715  degrees of freedom
AIC: 175464

Number of Fisher Scoring iterations: 5

The result shows the association between heart disease vs diabetic status function is statistically significant (p<2x10^-16).

#heart disease vs alcohol status
heart <- heart %>%
  mutate(alc_status = if_else(alc == "Yes", 1, 0))
mfp::mfp(outcome_heart ~ fp(alc_status), family = binomial, data = heart)
Call:
mfp::mfp(formula = outcome_heart ~ fp(alc_status), data = heart, 
    family = binomial)


Deviance table:
         Resid. Dev
Null model   183053.8
Linear model     182597.8
Final model  182597.8

Fractional polynomials:
           df.initial select alpha df.final power1 power2
alc_status          4      1  0.05        1      1      .


Transformations of covariates:
                       formula
alc_status I((alc_status+1)^1)

Coefficients:
   Intercept  alc_status.1  
     -1.6638       -0.6109  

Degrees of Freedom: 301716 Total (i.e. Null);  301715 Residual
Null Deviance:      183100 
Residual Deviance: 182600   AIC: 182600 
glm(outcome_heart ~ alc_status, family = binomial, data = heart) %>% summary()

Call:
glm(formula = outcome_heart ~ alc_status, family = binomial, 
    data = heart)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.274696   0.006498 -350.07   <2e-16 ***
alc_status  -0.610893   0.031105  -19.64   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 183054  on 301716  degrees of freedom
Residual deviance: 182598  on 301715  degrees of freedom
AIC: 182602

Number of Fisher Scoring iterations: 5

The association between heart disease vs alcohol status function is statistically significant (p<2x10^-16).

#heart disease vs smoking status
heart <- heart %>%
  mutate(smoke = if_else(smoking == "Yes", 1, 0))
mfp::mfp(outcome_heart ~ fp(smoke), family = binomial, data = heart)
Call:
mfp::mfp(formula = outcome_heart ~ fp(smoke), data = heart, family = binomial)


Deviance table:
         Resid. Dev
Null model   183053.8
Linear model     179804.8
Final model  179804.8

Fractional polynomials:
      df.initial select alpha df.final power1 power2
smoke          4      1  0.05        1      1      .


Transformations of covariates:
             formula
smoke I((smoke+1)^1)

Coefficients:
Intercept    smoke.1  
  -3.3988     0.7283  

Degrees of Freedom: 301716 Total (i.e. Null);  301715 Residual
Null Deviance:      183100 
Residual Deviance: 179800   AIC: 179800 
glm(outcome_heart ~ smoke, family = binomial, data = heart) %>% summary()

Call:
glm(formula = outcome_heart ~ smoke, family = binomial, data = heart)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.670536   0.009734 -274.35   <2e-16 ***
smoke        0.728308   0.012896   56.47   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 183054  on 301716  degrees of freedom
Residual deviance: 179805  on 301715  degrees of freedom
AIC: 179809

Number of Fisher Scoring iterations: 5

The association between heart disease vs smoking function is statistically significant (p<2x10^-16).

#heart disease vs physical activity
heart <- heart %>%
  mutate(phys = if_else(phys_A == "Yes", 1, 0))
mfp::mfp(outcome_heart ~ fp(phys), family = binomial, data = heart)
Warning: glm.fit: algorithm did not converge
Call:
mfp::mfp(formula = outcome_heart ~ fp(phys), data = heart, family = binomial)


Deviance table:
         Resid. Dev
Null model   183053.8
Linear model     180633.7
Final model  180633.7

Fractional polynomials:
     df.initial select alpha df.final power1 power2
phys          4      1  0.05        1      1      .


Transformations of covariates:
           formula
phys I((phys+1)^1)

Coefficients:
Intercept     phys.1  
  -1.1480    -0.6791  

Degrees of Freedom: 301716 Total (i.e. Null);  301715 Residual
Null Deviance:      183100 
Residual Deviance: 180600   AIC: 180600 
glm(outcome_heart ~ phys, family = binomial, data = heart) %>% summary()

Call:
glm(formula = outcome_heart ~ phys, family = binomial, data = heart)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.82707    0.01084 -168.57   <2e-16 ***
phys        -0.67912    0.01341  -50.66   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 183054  on 301716  degrees of freedom
Residual deviance: 180634  on 301715  degrees of freedom
AIC: 180638

Number of Fisher Scoring iterations: 5

The association between heart disease vs physical activity function is statistically significant (p<2x10^-16). All categorical variables show that the association with heart disease is statistically significant. It is also worth to see the overall relationship with heart disease when all the categorical variables and continuous variables (bmi, sleep) are considered together.

model2<- glm (outcome_heart ~ bmi + sleep + phys_H + diabetic_status + alc_status + smoke + phys, family=binomial, data = heart)
summary(model2)

Call:
glm(formula = outcome_heart ~ bmi + sleep + phys_H + diabetic_status + 
    alc_status + smoke + phys, family = binomial, data = heart)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -3.0351224  0.0448687 -67.645   <2e-16 ***
bmi             -0.0010043  0.0009957  -1.009    0.313    
sleep            0.0381331  0.0039993   9.535   <2e-16 ***
phys_H           0.0363439  0.0006275  57.920   <2e-16 ***
diabetic_status  1.0928719  0.0152434  71.695   <2e-16 ***
alc_status      -0.5640942  0.0318038 -17.737   <2e-16 ***
smoke            0.6212631  0.0133844  46.417   <2e-16 ***
phys            -0.2858621  0.0146181 -19.555   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 183054  on 301716  degrees of freedom
Residual deviance: 168069  on 301709  degrees of freedom
AIC: 168085

Number of Fisher Scoring iterations: 5

All the variables besides BMI show that they are statistically significant associated with heart disease. However, when doing the heart disease vs BMI alone (bivariate relationship), it showed statistically significant association between bmi and heart disease. There could be multicollinearity where the predictors are correlated with each other making it difficult to isolate their individual effects. Thus, it is worth seeing if each BMI category has a different effect on the log-odds of the outcome compared to the reference category (in this case, the healthy bmi group). Putting BMI into groups can sometimes help to capture non-linear relationships or multicollinearity to better represent the underlying structure of the data.

model3<- glm (outcome_heart ~ bmi_group + sleep + phys_H + diabetic_status + alc_status + smoke + phys, family=binomial, data = heart)
summary(model3)

Call:
glm(formula = outcome_heart ~ bmi_group + sleep + phys_H + diabetic_status + 
    alc_status + smoke + phys, family = binomial, data = heart)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -3.1801330  0.0354484 -89.712  < 2e-16 ***
bmi_groupobese        0.1042983  0.0176469   5.910 3.41e-09 ***
bmi_groupoverweight   0.2130307  0.0173049  12.310  < 2e-16 ***
bmi_groupunderweight -0.0213053  0.0551619  -0.386    0.699    
sleep                 0.0388455  0.0040016   9.707  < 2e-16 ***
phys_H                0.0365086  0.0006276  58.174  < 2e-16 ***
diabetic_status       1.0753132  0.0152059  70.717  < 2e-16 ***
alc_status           -0.5609814  0.0318131 -17.634  < 2e-16 ***
smoke                 0.6194444  0.0133861  46.275  < 2e-16 ***
phys                 -0.2840472  0.0146049 -19.449  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 183054  on 301716  degrees of freedom
Residual deviance: 167907  on 301707  degrees of freedom
AIC: 167927

Number of Fisher Scoring iterations: 5

For individuals categorized as underweight compared to the reference category (bmi healthy group), the estimated log-odds of having heart disease decrease by -0.0213053 but it is not statistically significant (p=0.699). All estimates have significant p-values (p<2x10^-16, and p-value for bmi_groupobese is 3.41 x 10^-9) indicating evidence of association between these predictors and the likelihood of having heart disease.

Conclusion: According to CDC and aside from the key risk factors like high blood pressure, high blood cholesterol, and smoking, several other medical conditions and lifestyle choices can put people at a higher risk for heart disease which include but not limited to diabetes, physical inactivity, and overweight and obesity. Based on the analysis performed, diabetes, physical inactivity, high bmi, smoking, sleep hours, alcohol consumption, and overall physical health status are have statistically significant association to heart disease.